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Abstract 

Feature selection is an important preprocessing step in machine learning and 
data mining. In real-world applications, costs, including money, time and other 
resources, are required to acquire the features. In some cases, there is a test cost 
constraint due to limited resources. We shall deliberately select an informative 
and cheap feature subset for classification. This paper proposes the feature se- 
lection with test cost constraint problem for this issue. The new problem has a 
simple form while described as a constraint satisfaction problem (CSP). Back- 
tracking is a general algorithm for CSP, and it is efficient in solving the new 
problem on medium-sized data. As the backtracking algorithm is not scalable 
to large datasets, a heuristic algorithm is also developed. Experimental results 
show that the heuristic algorithm can find the optimal solution in most cases. 
We also redefine some existing feature selection problems in rough sets, espe- 
cially in decision-theoretic rough sets, from the viewpoint of CSP. These new 
definitions provide insight to some new research directions. 

Keywords: Feature selection, Cost-sensitive learning, Constraint satisfaction 
problem, Backtracking algorithm, Heuristic algorithm, Decision-theoretic 
rough sets. 



1. Introduction 

Many data mining approaches employ feature selection techniques to speed 
up learning and to improve model quality [T5J [2TJ |S7| . These techniques are 
especially important for datasets with tens or hundreds of thousands of features 
[TT] . Attribute reduction [33] is a special type of feature selection problems 
studied by the rough set society. A reduct is a feature subset that is jointly 
sufficient and individually necessary to preserve certain information of the data 
|60j . For decision making, the most often addressed information is the positive 
region with respect to the decision class 43J. The objective of the classical 
reduct problem is to find a minimal reduct |47] . since simpler representation 



* Corresponding author. Tel.: +86 133 7690 8359 
Email addresses: minfanphd9163.com (Fan Min), huqinghua@hit.edu.cn (Qinghua Hu), 
williamfengzhuagmail.com (William Zhu) 



Preprint submitted to Elsevier 



September 26, 2012 



often provides better generalization ability according to Occam's razor principle. 
Other feature selection problems aim at finding feature subsets with maximal 
margin [4], maximal stability [1], minimal space |30j . etc. 

Most of these problems assume the data are already stored in datasets and 
available without charge. However, data are not free in real-world applications. 
There are test costs, such as money, time, or other resources [3TJ [52] to obtain 
feature values of objects. For example, it takes both time and money to obtain 
medical data of a patient [64]. Under this context, one would like to select 
the cheapest reduct [51]. This consideration and the parallel test assumption 
have motivated the minimal test cost reduct (MTR) problem [31] , Recently, 
a number of algorithms have been developed to deal with this problem (see, 
e.g., [T31 [3U H2]). Other related issues have also been identified in addressing 
numerical features [65], observational errors [59] . and test costs relationships 
[14] [32] . All these problems aim at searching the cheapest feature subset which 
preserves sufficient information for classification. 

Nevertheless, the available resource is usually limited, and users have to 
sacrifice necessary information to keep the test cost under budget. This paper 
introduces the feature selection with test cost constraint (FSTC) problem to 
formulate this issue. The upper bound of the available resource serves as the 
constraint. The FSTC problem is more general than MTR [31]. In fact, these 
two problems coincide when the constraint is no less than the test cost of the 
optimal reduct. If the constraint is so tight that the sufficiency condition cannot 
be met, then one cannot obtain a reduct. This is why the new problem falls in 
feature selection instead of in attribute reduction. 

In this paper, the FSTC problem is defined from the viewpoint of the con- 
straint satisfaction problem (CSP). In other words, it is defined with four as- 
pects, namely input, output, constraint, and optimization objective. The new 
definition is simpler and easier to comprehend than the one defined from the 
viewpoint of set family (38) . Furthermore, we redefine the classical reduct prob- 
lem and the minimal reduct problem [47] from the CSP viewpoint. We show 
that most feature selection problems in rough sets, including those of decision- 
theoretic rough sets (DTRS) [35J [57J [SHI EH1 EH] , can be viewed as extensions of 
the minimal reduct problem [57] from one or more of these four aspects. This 
viewpoint gives insight to meaningful research trends concerning feature selec- 
tion in a broader sense. In fact, there are some similar viewpoints, including 
the optimization viewpoint of attribute reduction on DTRS discussed by Jia et 
al. [20] . Compared with them, the one presented here is more systematic. 

We develop a backtracking algorithm to the FSTC problem for small and 
medium-sized datasets. Backtracking algorithms are natural and effective ap- 
proaches to CSPs for obtaining one or all optimal solutions. However, they are 
seldom employed to deal with feature selection problems in rough set theory 
(sec, e.g., [3, 373132]), where discernibility matrix based approaches are more 
popular (see, e.g., [J5J [37J [55J [S3]). One possible reason is that people have 
not defined attribute reduction problems explicitly as CSPs. As an exhaustive 
algorithm, the backtracking algorithm has a time complexity exponential with 
respect to the number of features. 
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We also develop a heuristic algorithm with polynomial time complexity for 
large datasets. We employ the addition-deletion approach [ST] to design a 
heuristic function based on information gain often employed in similar prob- 
lems [51 SSI [531 [5TJ • It is similar to the one proposed in [3JJ to prefer low cost 
features through A-weighting, where A is a user specified parameter. The differ- 
ence between the new algorithm and the one employed in [31] lies in the stopping 
criteria. To improve the performance of the algorithm, we employ the compe- 
tition strategy [UJ. With this strategy, different feature subsets are obtained 
through setting different A values, then the best one is selected. This strategy 
can trade the quality of the result with the run time. More importantly, with 
this strategy, the user is not involved in the setting of A. Instead, a set of A 
values which are valid for any dataset are specified by the algorithm. 

Four open datasets are employed to study the performance of our algo- 
rithms. Experimental results show that the backtracking algorithm is efficient 
for medium-sized data. It takes less than 0.4 second to obtain an optimal feature 
subset for the mushroom dataset, which contains 22 features and 8124 objects. 
The backtracking algorithm is approximately 10 times faster than SESRA |3"B"] . 
which is based on another definition of the problem. The heuristic algorithm 
is stably more efficient than the backtracking one. With the help of the com- 
petition strategy, the heuristic algorithm can find the optimal solution in most 
cases. 

The rest of the paper is organized as follows: Section [2] presents the prob- 
lem definition. The classical reduct problem and the minimal test cost reduct 
problem are also redefined. Section [3] proposes both backtracking and heuristic 
algorithms. Experimental results on four UCI (University of California - Irvine) 
datasets are discussed in Section [4] Then Section [5j studies existing feature 
selection problems in the rough set society from the viewpoint of CSP. Some 
interesting new problems are also briefly discussed. Finally, Section [6] presents 
the concluding remarks and further research directions. 

2. Problem definition 

This section reviews three feature selection problems in rough sets. Two of 
them are under the classical rough sets [J3], and the last one is concerned with 
test cost [3TJ. These problems are redefined as CSPs. Moreover, we propose a 
new problem called feature selection with test cost constraint. 

2.1. Classical feature selection problems in rough sets 

Data models are fundamental for feature selection. This paper only considers 
decision systems. 

Definition 1. ' L 58l A decision system (DS) S is the 5-tuple: 

S = (U, C, d, V = {V a \a ECU {d}}, I = {I a \aeCU {d}}), (1) 

where U is a finite set of objects called the universe, C is the set of features, d is 
the decision class, V a is the set of values for each a G C U {d}, and I a : U — > V a 
is an information function for each a G C U {d}. 
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Let the decision system S — (U,C,d,V,I) be nominal, that is, all features 
in C are nominal. Any ^ B C C U {d} determines an indiscernibility relation 
1(B) on U. A partition determined by B is denoted by U/I(B), or U/B for 
brevity. Let B_(X) denote the B— lower approximation of X. The positive region 
of {d} with respect to B C C is defined as POSb({g?}) = Uxec//{d} 

Definition 2. f ^?F Any B C C is called a decision relative reduct (or a. reduct 
for short) of S iff: 

1. POS B ({d}) = POS c ({d}); and 

2. Va e B,POS B _ {a} ({d}) c P05 c ({d})- 

Definition [2] indicates that a reduct is 1) jointly sufficient and 2) individually 
necessary for preserving a particular property (positive region in this context) 
of the decision system [231 SSI [HOI IHS] ■ In other words, there are two constraint, 
named sufficiency and necessity, respectively. Consequently, the problem of 
obtaining one reduct can be defined in the CSP style as follows. 

Problem 3. The attribute reduction problem. 
Input: S=(U,C,d,V,I); 
Output: B C C; 

Constraints: (1) POS B ({d}) = POS c ({d}); 
(2) Va € B,POS B _ {a} ({d}) C POS c ({d}). 

There may exist many reducts for a decision system. Let the set of all 
relative reducts of S be Red(S). Any R £ Red(S) is a minimal reduct if and 
only if \R\ is minimal. Minimal reducts are preferred because they provide the 
simplest representation of the knowledge. The problem of finding a minimal 
reduct is called the minimal reduct problem, as defined as follows. 

Problem 4. The minimal reduct problem. 
Input: S = (U,C,d,V,I); 
Output: B C C; 

Constraint: POS B ({d}) = POS c ({d}); 
Optimization objective: min|_B|. 

Problem [4] has an optimization objective, which is typical in CSP. Note that 
that there is only one constraint, namely sufficiency. This does not indicate 
that the necessity constraint is not met. In fact, the necessity constraint can be 
derived from the optimization objective. One can easily prove this by contra- 
diction. That is, if there are superfluous features, the size of the feature subset 
cannot be minimal. In other words, the problem definition is simplified while 
viewed as a CSP. 
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2.2. Feature selection minimizing test cost 

Test cost is an important issue in many applications. We have built a hier- 
archy of six test-cost-sensitive decision systems [35]. Here we present a simple 
model which will be used in defining the new problem of this paper. 

Definition 5. [32] A test-cost-independent decision system (TCI-DS) S is the 
6-tuple: 

S = (U, C, d, {V a \a eCU {d}}, {I a \a eCU {d}}, c), (2) 

where U, C, d, {V a }, and {I a } have the same meanings as in Definition^ c : 
C — > K + U {0} is the test cost function. Test costs are independent of one 
another, that is, c(B) = ^2 aeB c(a) for any B C C . 

The minimal test cost reduct (MTR) problem proposed in [3T] can be rede- 
fined as follows. 

Problem 6. The minimal reduct problem. 
Input: S = (U,C,d, V,I,c); 
Output: B C C; 

Constraint: POS B ({d}) = POS c ({d}); 
Optimization objective: min|c(_B)|. 

One can see there are two differences between Problem [6] and Problem |4j 
The first is the input, where the test cost is the external information. The 
second is the optimization objective, which is to minimize the test cost, instead 
of the number of features. 

2.3. Feature selection with test cost constraint 

Sometimes we are given limited resources to obtain the feature values. We 
proposed the issue of optimal sub-reduct in [35J [35] to address this issue. Here 
we use the positive region instead of the conditional information entropy to 
define respective concepts. 

Definition 7. Let S = (U,C,d, V, I, c) be a TCI-DS and m the test cost upper 
bound. The set of all feature subsets subject to the constraint is 

T(S, m) = {BC C\c(B) < m). (3) 

In T(S, m), the set of all feature subsets with the maximal positive region is 

M T (S,m) = {Be T(S,m)\POS B ({d}) = mm{POS B >({d})\B' € T(S,m)}}. 

(4) 

In MT{S,m), the set of all optimal sub-reducts is 

P MT (S,m) = {B e M T (S,m)\c(B) = mm{c(B')\B' e M T (S,m)}}. (5) 

Any element in Pm t (S, m) is called an optimal sub-reduct with test cost con- 
straint, or an optimal sub-reduct for brevity. 
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In Definition [7| Equation ([3]) ensures the constraint is met; Equation Q en- 
sures most informative feature subset is selected; and Equation ^ ensures test 
cost is minimized. The problem of constructing Pm t (S, to) is called the optimal 
sub-reducts with test cost constraint (OSRT) problem [351 138j . Unfortunately, 
the definition is rather prolonged and hard to read. Next we follow the style of 
Problem [4] to present the following problem. 

Problem 8. The feature selection with test cost constraint (FSTC) problem. 
Input: S — (U, C, d, V, I, c), the test cost upper bound to; 
Output: B C C; 
Constraint: c(B) < to; 

Optimization objectives: (1) max POSb{{<1}); and (2) min c(B). 

Note that the two objectives are not equally important. They are the pri- 
mary and the secondary objectives, respectively. In fact, Problem[8]is the same 
as the OSRT problem. However the problem definition is simpler and easier to 
comprehend. This phenomenon indicates that the form of CSP is more appro- 
priate for this kind of problems. 

By comparing Problems [6] and [8j we observe the following differences. First, 
the constraint is expressed by the test cost instead of the positive region. Second, 
the first objective of Problem [8] is to maximize the positive region. Third, the 
objective of Problem [6] becomes the secondary objective of Problem [8j This 
objective is considered after the primary one is achieved. 

In fact, Problem [8] is more general than Problem [6] Let B' be a minimal 
test cost reduct subject to Problem[6] If m > c(B'), the constraint is met when 
the primary objective is achieved. In other words, the constraint is essentially 
redundant. The first objective will be replaced by POSB{{d}) — POSc{{d}), 
which serves as a constraint. The second objective is then the only objective. 
Consequently, Problem [8] coincides with Problem [6] in this case. 

3. Algorithm design 

This section presents two algorithms. One is a backtracking algorithm, the 
other is a heuristic algorithm. The backtracking algorithm always produces an 
optimal solution to the problem. The heuristic algorithm is more efficient to 
large datasets, however the feature subset obtained may not be optimal. 

3.1. The backtracking algorithm 

The backtracking algorithm is a natural solution to CSP. In the rough set 
society, people seldom employ this algorithm for attribute reduction. This is 
partly due to the form of problem definition as shown in Definition [2] The 
backtracking algorithm to the FSTC problem is illustrated in Algorithm [l] To 
invoke the algorithm, one should initialize the global variables to, let B = 0, 
and use the following statement: 
backtracking^, 0); 
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Algorithm 1 The backtracking algorithm to the FSTC problem 
Input: Selected feature subset B' , feature index lower bound I 
Output: Results are stored in the global variable B 
Method: backtracking 

l: for (i = l;i< \C\; i ++) do 

2: B" = B' U {aj};//One more feature 

3: if (c(B") > to) then 

4: continue; //The constraint is violated 

5: end if 

6: if {POS B „({d}) = POS c ({d})) then 

7: throw new Exception( "Coincides with the MTR problem"); 
8: end if 

9: if (\POS B »({d})\ > \POS B ({d})\ V (POS B »({d}) = POS B {{d})) A 

(c(B") < c(B))) then 
10: B = B";//A better feature subset 
11: end if 

12: backtracking (B" , i + 1);//Backtracking 
13: end for 



then at the end of the algorithm execution, an optimal feature subset will be 
stored in B. 

In Algorithm 1, Lines 3 through 5 check the constraint. Feature subsets 
violating the constraint are simply discarded. Lines 6 through 8 indicate if the 
positive region of the current feature subset is the same as C, namely the suffi- 
ciency condition can be met, the FSTC problem coincides with the MTR prob- 
lem. In this case we only need to address the MTR problem. Lines 9 through 
11 are devoted to the optimization objective. \POS B "({d})\ > \POS B ({d})\ 
serves for the first objective. c(B") < c(B) serves for the second; it is checked 
only if POS B "({d}) = POS B ({d}). In our implementation in Coser [40], the 
algorithm is implemented to avoid repeated computation of positive regions. 

Note that a feature is never removed from a subset. This is important to 
ensure the correctness of the algorithm. Line 2 shows that feature ai is added. It 
may happens that POS B » {{d}) = POS B mj{ ai } ({d}), i.e., does not contribute 
to the positive region. However, is not removed because it may be useful while 
combined with other features. We introduce the following example to explain 
the reason. 

Example 9. Consider the decision system listed in Ta&Ze [7J Let c = [2,3,10] 
and to = 6. Because c(a3) = 10 > m, a% is never selected. We have 
POS{ ai }({d}) = POS{ a2 }({d}) = 0. That is, neither a\ nor ai contributes to 
the 'positive region alone. However, POS{ ai . a ,}({d}) = {x2, £3, X4}, hence both 
ai and are useful. The optimal feature subset is {01,02}, which is the output 
of the algorithm. 

In fact, B may contain some redundant features during the algorithm execu- 
tion. It will eventually replaced by another feature subset with bigger positive 
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Table 1: A decision table for Example [9] 



u 


ai 


a 2 


a>3 


d 


Xl 


Y 


Y 


Y 


A 


X2 


N 


Y 


N 


B 


X3 


Y 


N 


N 


B 


x± 


N 


N 


Y 


A 


x 5 


Y 


Y 


Y 


B 



region or smaller test cost in Line 10. Example [9] will be discussed further in 
Section E21 

The space complexity of Algorithm [T] is easy to analyze. The algorithm 
searches in a tree with depth \C\ in a depth- first manner. Whenever the back- 
tracking method is invoked there is a need to obtain a new partition of the 
objects, which takes 0( |C/| x |C|) space. Hence the space complexity is 

0(|qx|L7|x|C|) = 0(|C/|x|q 2 ). (6) 

Now we analyze the time complexity. The number of feature subsets is 2'°'. 
In the worst case all of them are checked. On the other hand, a feature subset 
is never checked twice. Therefore the number of backtracking steps, namely 
the number of time the backtracking method is invoked, is bounded by 2' c ''. 
As indicated by Line 1, each time we need to compute a feature subset with 
one more feature. In this way, the computation involves splitting the datasct 
according to the current feature. Respective operation takes 0(\U\ x \V ai \) of 
time. Let v max = max og( 7 \ V a \. The time complexity is 

0(\U\ x 2l c l x v max ). (7) 

Unfortunately, the average time complexity is hard to analyze. We will show 
by experimentation that it is significantly lower than the worst case. 

The design of the algorithm is often closely related to the problem definition. 
Algorithm [l] can be easily obtained from Problem [HJ Similarly, the SESRA algo- 
rithm |38| has three main steps, as indicated by Definition [7] This phenomenon 
shows further the influence of the problem viewpoint to the problem definition 
and the algorithm design. 

3.2. The heuristic algorithm 

The backtracking algorithm is not scalable. As indicated by Equation ([7]), 
the run time can be exponential with respect to the number of features in the 
worst case. Hence we need to design heuristic algorithms for large datasets. 
We adopt the well known addition-deletion approach [331 [ST] to design our 
algorithm, since the deletion approach is inefficient for large datasets [5Tj . 

The positive region seems to be a natural heuristic information, however, it 
may not work on some datasets. Let B be the currently selected feature subset. 
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We would like to select a% G C — Sifit is informative (i.e., \POS Bu ^ a .j — POSb\ 
is big) and cheap (i.e., c(dj) is small). Unfortunately, we have counterexamples 
to this approach. Let us consider Example [9] again. At the very beginning 
B = 0. Since fOSsu^} = 0, ai has no contribution to the positive region 
and therefore cannot be selected. For the same reason 02 is not selected. CI3 
cannot be selected due to the test cost constraint. Finally, this approach fails to 
construct the optimal feature subset {ai, 0,2}. Such cases happen in applications 
frequently. We have tested this approach on four datasets listed in [5] In the 
Voting and Tic-tac-toe datasets [2], no feature alone produces positive region, 
therefore the approach fails given any test cost setting. 

A feasible heuristic information is the information gain [M] . Generally, 
a feature subset with less information entropy tends to produce bigger positive 
region. Therefore we employ information gain in this paper to design our algo- 
rithm. Let H(Q\P) be the conditional information entropy of Q w.r.t. P [54] . 
Let further B C C and a, G C — B, the information gain of a.; w.r.t. B is 

f e (B, a t ) = H({d}\B) - H({d}\B U {a,-}). (8) 

It is proven that \POS BU{a . } - POS B \ > gives H({d}\B) - H({d}\BU{a,i}) > 
0. But the reverse does not hold. In other words, information entropy is more 
sensitive to feature than positive region. 

To select the current best feature, both information gain [53] and test cost 
are taken into consideration. We use the same approach as that in [3T] to select 
the current best test. And the A- weighted function is defined as 

f(B, ai ,c) = f e (B, ai )4, (9) 

where A is a non-positive number. With the introduction of A, cheaper features 
are preferred. If A = 0, f(B,ai,c) = f e (B,a,i), and the heuristic information 
coincides with the information gain. 

Our algorithm is listed in Algorithm [2j The algorithm first constructs a 
feature subset meeting the constraint and with minimal information entropy 
in Lines 4 through 19. Lines 14 through 18 are not necessary, however they 
help speeding up the algorithm. Then redundant features are removed from the 
viewpoint of the positive region in Lines 20 through 24. 

One may find that the algorithm is successful on Example [9] If we remove 25 
from the dataset, this algorithm also fails. To make the matter worse, the ID3 
decision tree encounters the same problem. This might be a drawback of heuris- 
tic algorithms compared with exhaustive ones. Fortunately, this extreme case 
seldom happens in applications. On many UCI datasets we tested, Algorithm 
[2] never fails to construct a feature subset. 

The space complexity of Algorithm [2] is decided by the size of the decision 
system. It is 

0(\U\ x |C|). (10) 

Now we analyze the time complexity. In the worst case, the while loop indicated 
by Line 4 would execute |C| times, and each time all remaining features are 
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Algorithm 2 The A-weighted heuristic algorithm 
Input: S = (U, C, D, V, I,c),m 
Output: B C C 
Method: A-weighted-fstc 

1: B = 0; //initialize the output 
2: CA = C; / /unprocessed features 
3: ci = m; //available test cost 

//Compute a feature subset with the least information entropy 
4: while (CA ^ 0) do 

5: For any a £ CA satisfying C(a) < q, compute f(B,a,c); 
I /Addition 

6: Select a' £ CA with the maximal f(B,a',c); 
7: B = BU {a'}; CA = - {a'}; q = q - c(a'); 

/ /Deletion, remove redundant features from the viewpoint of information 

entropy 
8: for (each a G B) do 

9: if (H({d}\B - {a}) = H({d}\B)) then 
10: B = B - {a'}; //a' is redundant 

11: q = ci + c(a'); //restore the constraint 

12: end if 
13: end for 

//Remove features not satisfying the constraint to speed up 
14: for (each a £ CA) do 
15: if (c Q > q) then 
16: CA = CA-{a}: 

17: end if 
18: end for 
19: end while 

//Remove redundant features from the viewpoint of positive region 
20: for (each a £ B) do 

21: if (POS B _ {a , } ({d}) - POS B ({d})) then 
22: B = B - {a'}; //a' is redundant 
23: end if 
24: end for 
25: return B; 



checked in Line 5. Line 5 is executed at most E^o^d* 7 ! ~ = °(\ c \ 2 ) times - 
Since f(B, a, c) is based on the positive region, similar to the analysis in Section 
|3.1[ the time complexity is 

0(\U\ x |C| 2 x v max ). (11) 

In applications, it is hard for the user or even the expert to set a rational A. 
To make the matter worse, the best A does not always produce the best result. 
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We can adopt the competition strategy working as follows. First, it specifies a 
set of A values, then it obtains corresponding feature subsets using Algorithm 
[2j finally it chooses the feature subset with the maximal positive region and the 
minimal test cost. Since feature subsets produced by different A values compete 
against each other with only one winner, this strategy is called the competition 
strategy [3"T] . 

Formally, let B\ be the feature subset constructed by Algorithm [2] using the 
exponential A. With A the set of user-specified A values, 

POS A = max POS Bx ({d}) (12) 
AeA 

is the maximal positive region that can be obtained with the competition strat- 
egy. This process requires the algorithm to be run |A| times and the time 
complexity would be 0(|A| x \U\ x \C\ 2 x v max ) instead. It is acceptable for 



relatively small |A|. We will show that setting A is easy in Section 4.3 



4. Experiments 

The main purpose of our experiments is to answer the following questions. 

1. Is the backtracking algorithm efficient? 

2. Is the heuristic algorithm effective? 

3. Is there an optimal setting of A for any dataset? 

4. Is the extra computation time consumed by the competition strategy 
worthwhile? 



4--1- Datasets 

We deliberately select four datasets from the UCI Repository of Machine 
Learning Databases [2J. Their basic information is listed in Table [2j where |C| 
is the number of features, \U\ is the number of instances, and d is the name of 
the decision. 



Tabic 2: Dataset information 



Name 


Domain 


\C\ 


\u\ 


d 


Zoo 


zoology 


16 


101 


type 


Voting 


society 


1G 


435 


vote 


Tic-tac-toe 


game 


9 


958 


class 


Mushroom 


botany 


22 


8124 


classes 



There are a number of notes to make. While counting the number of features, 
the decision is not included. Missing values (e.g., those appearing in the Voting 
dataset) are treated as one particular value. That is, ? is equal to itself, and 
unequal to any other value. The "animal name" feature is not useful in the Zoo 
dataset, and we simply remove it. 
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Table 3: Backtracking steps on four datasets (with 100 test cost settings) 

Dataset \C\ 2' c '' \B\ backtracking steps 

min max av. min max av. 

Zoo 16 65,536 4 6 4.74 132 4,089 1,112 

Voting 16 65,536 7 9 8.23 8,139 46,421 24,354 

Tic-tac-toe 9 512 6 7 6.70 271 439 386 

Mushroom 22 4,194,304 3 6 4.31 26 4,899 725 



Table 4: Run time (ms) on four datasets (mean values for 100 test cost settings) 



Dataset SESRA SESRA* backtracking heuristic 

Zoo 50 48 7 2 

Voting 5,334 2,498 485 18 

Tic-tac-toe 167 39 28 26 

Mushroom 3,661 857 367 180 



Most datasets from the UCI library [2] do not provide test cost information. 
For statistical purposes, we need to produce them. Different test cost distribu- 
tions correspond to different applications. Three distributions, namely uniform 
distribution, normal distribution, and Pareto distribution have been discussed 
in |31) . For simplicity, this paper only employs the uniform distribution to gen- 
erate random test cost in [1..100]. According to Definition [5] two TCI-DS are 
different once their test cost settings are different. In this sense, we can produce 
as many TCI-DS as needed from a given DS. 

4-2. Efficiency of the algorithms 

We need to know the efficiency of the backtracking algorithm from three 
viewpoints. The first is the average time complexity. We need to know whether 
or not the number of backtracking steps is exponential with respect to the 
number of features. The second is the time taken for small or medium-sized data. 
In fact, diagnosis data for one particular disease in a hospital may contain only 
a few thousands of instances. For those datasets, an optimal solution is always 
required. The third is the run time compared with other exhaustive approaches. 
The backtracking algorithm is compared with SESRA and SESRA* proposed 
in [38]. SESRA is based on Definition [7j and SESRA* is an enhanced version. 

Table [3] shows the number of backtracking steps, namely how many times 
the backtracking method is invoked. Let BS denote this number. 2l c l is the size 
of the backtracking tree, hence it is also the upper bound of BS. For the Voting 
dataset, \C\ — 16 and sometimes \B\ = 9. Therefore the maximal BS can be 
46,421, which is close to 2' c l = 65, 536. This indicates that sometimes BS can 
be exponential with respect to |C|. In contrast, For the Mushroom dataset, 
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\C\ = 22 and sometimes \B\ — 6. The maximal BS is only 4,899, which is 
significantly smaller than 2^ = 4, 194, 304. In one word, BS is relevant to not 
only |C|, but also \B\. 

Table|4]compares the performance of the backtracking algorithm with SESRA 
and SESRA* [35] in terms of the run time. The backtracking algorithm only 
takes 367 ms and 485 ms on the Mushroom and Voting datasets, respectively. 
In other words, it is appropriate for many real applications. Moreover, the 
backtracking algorithm stably outperforms SESRA and SESRA*. Only about 
1/10 time is taken on the Tic-tac-toe and Mushroom datasets compared with 
SESRA. These results show further the advantage of the CSP viewpoint. 

For convenience, the run time of the heuristic algorithm is also listed in Table 
[4j The heuristic algorithm is always more efficient than exhaustive algorithms. 
The efficiency difference becomes significant when the run time of exhaustive 
algorithms is long. Moreover, the efficiency depends more on the datasct size 
instead of \B\. To sum up, the heuristic algorithm can deal with larger datasets 
compared with exhaustive algorithms. 



4-3. Effectiveness of the heuristic algorithm 

We compare the performance of the three approaches mentioned in Section 
3.2 All three are based on Algorithm [2j The first approach, called the non- 



weighting approach, is implemented by setting A = 0. The second approach, 
called the best A approach, chooses the best A value in A = {0, -0.25, -0.5, . . . , 
-3}. The third approach is the competition strategy based A as discussed in 
Section l3~2l 

We now look at the influence of the A setting. Fig. [I] shows the probability 
of finding the optimal feature subset for given A. Although —0.75 seems a 
reasonable value, there does not exist an optimal setting of A for all datasets. 
In other words, A is hard to specify. 

General results are depicted in Fig. [2] from which we observe the follow- 
ing. First, the approach without taking into considering the test cost performs 
poorly. In most cases it cannot find the optimal feature subset. Second, if we 
specify A appropriately, namely A = A*, the results are more acceptable. It is 
more likely to find the optimal feature subset. However, as discussed earlier, we 
often have no idea how to specify it. Third, the performance of the competition 
strategy is much better than the other two. In more than 70% cases it produces 
the optimal feature subset. Moreover, the user does not have to know the op- 
timal setting of A. In one word, the extra computation resource consumed by 
the competition strategy is worthwhile. 



5. The CSP viewpoint to feature selection 

Problems [3j|4j [6] and [8] provide the CSP viewpoint to feature selection. Most 
existing feature selection problems in rough sets can be viewed extensions of 
Problem [4] in one or more of the following aspects: input, output, constraint, 
and optimization objective. We analyze them from each aspect as follows. 
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Figure 1: The probability of finding the optimal feature subset for given A 



First, there are some extensions concerning the input data model. Since the 
data model is essential, these extensions often require extensions of the Pawlak 
rough set. 

1. Some conditional features are numeric. Numeric data are quite differ- 
ent from symbolic ones which are employed in Pawlak rough sets |43] . 
Coverings, instead of partitions, can be formed according to feature sets. 
Covering-based rough sets [BU (701 [TTJ [53] deal with reduction of coverings. 
The neighborhood rough set model (T5J EH HZ1 EE] generates neighborhood 
systems on such data. 

2. The data are uncertain. The uncertainty of data may be caused by noise, 
observational error, etc [5J. The error range based covering rough set 
model was proposed to deal with observational error. Another well 
known data model might be interval- valued fuzzy sets |10j , which has been 
studied through rough sets [§]. 

3. There are external information on features and feature subsets |62| . Some 
information are subjective and can be expressed by user preference. For 
example, features are ranked by the user, or even directly specified by 
an expert 34J. Other information are objective. For example, there is 
a weight or test cost for each feature [33] [52]. There are a number of 
possible extensions to the weight computation of an feature subset. These 
are additive, average, maximal, minimal extensions [62] • In (32] . six data 
models concerning test cost and relationships among features are defined. 
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Figure 2: The probability of finding the optimal feature subset 



Test-cost-sensitive attribute reduction problems [TH [32] can be defined 
according on these models. 

4. There are external information on classification. The most widely adopted 
information might be misclassification cost [52] [68] • DTRS [22], [59] [61] 
consider loss functions concerning different classifications. These classifi- 
cations correspond to positive, negative and boundary rules. There are 
cost for both misclassifications and correct classifications. 

5. There are external information on both conditions and classifications. In 
applications such as clinic systems, both test costs and misclassification 
costs exist [52 . This issue is addressed in [39] . 

Second, there are some extensions concerning the output. People consid- 
ered generalized reduct problems, such as attribute value reduction [33], dis- 
cretization |41j , symbolic value partition |33j . Since features are transformed or 
combined, these problems should be called feature extraction instead [12" 1 119 ) . 

Third, there are some extensions concerning the constraint. Many of them 
are still expressed with the same form as Problem [4] However, the definitions 
of the positive region are different due to the change of the input data model. 
Others are expressed with different forms. 

1. The computation of the positive region follows DTRS models |57 [ I59 [ IBT ] . 
In DTRS, parameters 7, /3 and <5 are used to define positive regions. They 
are in turn computed based on a set of loss functions according to the 
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Bayesian decision procedure. The major advantage is that parameters are 
not set by the user subjectively. Therefore the models have good semantics 
and wide applications. 

2. The computation of the positive region follows the variable precision rough 
set model [75], or the Bayesian rough set model [23]. There is a user- 
specified parameter j3 to indicate the admissible classification error. Pawlak 
rough sets can be viewed a special case of variable precision rough sets 
where (3 = 0. This extension has inspired fruitful research works con- 
cerning probabilistic rough sets [26] - /3- lower distribution and /3-upper 
distribution [25] have been more closely studied. 

3. The computation of the positive region follows the neighborhood rough 
set model [TH1 HJJ HE] or the error range based covering rough set model 
[39] . In the neighborhood rough set model [16l H7J [18] , positive regions 
also rely on a user specified parameter S, which is the distance threshold. 
In the error range based covering rough set model [39] . positive regions 
also rely on error ranges of data. Error ranges are determined by testing 
instruments and therefore they are objective. 

4. The constraint is conditional information entropy [48j [53] [27] . It is ex- 
pressed by H{B\{d}) = H(C\{d}) where H(B\{d}) denotes the condi- 
tional information entropy of B with respect to d. The conditional infor- 
mation entropy constraint is stricter than the positive region constraint. 
That is, the feature subset meeting the positive region constraint may 
not meet the conditional information entropy constraint. While the re- 
verse does not hold. These two constraints are equivalent if and on if the 
decision system is consistent [3j|] . 

Fourth, there are some extensions concerning the optimization objective. 

1. Minimize the cost. In test cost sensitive decision systems, the objective 
is to minimize the total test cost [31] . In misclassification cost sensitive 
decision systems, the objective is to minimize the average misclassification 
cost [231 [SH [ST] , or the risk [53J [55] . In decision systems with both test 
cost and misclassification cost, the objective is to minimize the total cost 

2. Minimize the feature space Jlaes l^al- For the minimal reduct problem, 
features with more values are more likely to be selected. These features, 
however, have weaker generalization ability than features with less values. 
The new objective can help amend this drawback. When the domains of 
features have the same size, the new objective coincides with Problem [4] 

EDI- 

3. Maximize the stability. Dynamic reducts [1] are stable in the process of 
decision table sampling. Decision rules computed from dynamic reducts 
are more reliable. Parallel reducts [7] follow the same idea. 

4. Maximize the margin. A margin is a geometric measure for evaluating 
the confidence of a classifier with respect to its decision [U [S]. Unlike 
other metric such as positive region or conditional information entropy, 
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this measure is not monotonic. That is, it may increase or decrease when 
more features are selected. 

Most problems mentioned above are no longer reduct problems. When the 
input is changed, the indiscernibility relation may not exist. One can only con- 
sider weaker relations such as the similarity relation [50]. When the constraint 
is changed, the positive region is not computed, or computed not in the Pawlak 
approach (see, e.g., [TTl [39] ) . Reducts subject to the conditional information 
entropy constraint may not be a Pawlak reduct. When the optimization objec- 
tive is changed, the optimal reduct may not be minimal. Feature subset with 
the minimal total cost [3H| may not be a reduct at all. 

From these extensions, many meaningful new problems can be identified. A 
few of them are listed as follows. 

1. Feature selection under DTRS with test cost. Note again the external 
information in DTRS cannot be expressed by a misclassification matrix. 
Test cost is also an external information. By considering more external 
information, the problem is more interesting and challenging. 

2. Feature selection with positive region constraint. To have a even simpler 
representation, we may require the positive region to be preserved to a 
certain degree. For example, the feature subset should have a positive 
region more than 95% of the original. Note that this problem is different 
from the variable precision rough set model |72] where the definition of 
positive region is changed. Their motivations are, however, quite similar 
in that they all deal with the overfitting issue. 

3. Minimal test cost feature selection with positive region constraint. This 
problem differs from the last one in that the objective is to find a feature 
subset with least cost. It is a hybrid of the last problem and the MTR 
problem. It can be also viewed a dual problem of the FSTC problem. 

Some of these problems are new combinations of existing extensions, some 
involve new extensions. We observe that the number of possible combinations 
is big, and many of them have certain application areas. In other words, much 
research issues are opened from the CSP viewpoint. 

6. Conclusions and further works 

This paper proposed a new feature selection problem concerning the test cost 
constraint. The new problem, called FSTC, has a wide application area because 
the resource one can afford is often limited. Both backtracking and heuristic 
algorithms were designed for it. Experimental results showed the efficiency of 
the backtracking algorithm compared with existing ones, and the effectiveness 
the competition strategy based on the A-wcightcd heuristic algorithm. It should 
be noted that with the competition strategy, we do not have to know the optimal 
setting of A. Instead, we can specify a set of A values which are valid for any 
dataset. 
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A more important contribution of the paper is the CSP viewpoint to feature 
selection in rough sets. From this viewpoint, most feature selection problems are 
natural generalizations of the minimal reduct problem. This viewpoint helps us 
to identify some other meaningful problems from the following aspects: input, 
output, constraint, and optimization objective. In summary, this paper has 
indicated important research trends concerning feature selection beyond rough 
sets. 
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